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Introduction: 

The COVID 19 disease outbreak started in December 2019 in Wuhan city, China. The situation 
became an epidemic following the spring festival in China. The COVID 19 virus has been found 
spreading globally, including low income and developing countries throughout the year. The virus 
killed more than eighteen hundred and infected over seventy thousand people in its first fifty days. 
Reported cases of the infected infection have gone up to six million positive cases in India (September 
2020). The incubation period for the viral infection is found to be 2 to 14 days. Some of the common 
symptoms of the COVID 19 disease include cough, high fever, sore throat, and breathlessness. 
According to the World Health Organization (WHO) report, 60 stated that in India, community 
transmission could not be prevented, and the screening of the entire population in mass gathering is 
not a feasible task. 

Govt, of India has taken many initiatives to minimize the spread of COVID-19 infection in the 
country. Despite the efforts taken, the infection rate of the COVID-19 in India is rising on a rapid scale. 
Therefore, Coronavirus Disease-2019 tracking and diagnostic testing are critical without risking being 
tested for the infection repeatedly[l]. Health workers and clinicians who are at the front line of such 
a pandemic are at a higher risk of being exposed to hazardous pathogens such as COVID 19. During 
this pandemic, especially in developing countries like India, it is imperative to study the COVID 19 
trends and results and to help people understand their test results using data analysis methods [2],It 
is also necessary to use the relevant information and device plans to help potentially predict the 
outbreak of the infection. 

Artificial Intelligence has been a breakthrough in the last decade, which has been used in 
multiple applications, including Autonomous systems, prediction, and detecting system used in our 
day-to-day life. The current study explores various aspects associated with the COVID-19 predicting Al 
application where the test results and parameters are analysed to give the user information if they 
might have an infection or not based on the inputs[3], 

Al has been applied for detecting and predicting the COVID 19 pandemic, and this paper 
describes its use in deploying such a COVID 19 Predicting model. A COVID 19 prediction model could 
also minimise the error that may creep in during manual testing methods. An automated prediction 
model translates to less time spent on one test, making the testing method fast. It also ensures that 
the risk of COVID healthcare workers reduces. The proposed COVID 19 Prediction model, using a 3- 
way random forest, was designed to overcome the traditional healthcare system, using machine 
learning algorithms and clinical process parameters to predict the most likely outcome of a patient 
and identify if they might be infected by the COVID 19 infection. 


Literature Review: 

In a previously published article[4], the authors proposed a polynomial regression algorithm 
as a special case of linear regression to work on correlated but non-linearly related dataset variables. 
This method produced an accuracy of 93%. Against support vector machine (SVM) model, which was 
implemented on the same COVID dataset. Authors in the published literature[5] suggest a novel 
automatic diagnosis pipeline for COVID-19 by leveraging features from CT images after trial 
implementations with Machine Learning (ML) models like Linear Regression, Support Vector Machine, 
Gaussian Naive Bayes, k - Nearest Neighbour, Neural Networks. However, the maximum accuracy 
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achieved was 95.5% only, which is not promising enough for a medical diagnosis problem like COVID 
prediction. An accuracy of 91.9% was obtained by the method used in [6], where the model was 
trained on chest X-ray (CXR) dataset. The suggested model was a patch-based convolution neural 
network. This was comparable to the COVID-Net model. In the paper [7], chest X-Ray images were 
used for training ResNet-101 and ResNet-152 and acquired 96.1% accuracy. 


Proposed Methodology: 

Dataset Description: 

COVID-19 prediction model created was developed based on the dataset, which consists of 
279 cases. These are randomly extracted from patients admitted to the hospital between the end of 
February 2020 and mid of March 2020. The dataset includes gender, age, and data values from routine 
blood tests. The resultant prediction model was compared against the RT-PCR test for COVID-19 by 
the nasopharyngeal swab. Data values in the dataset are shown in Table 1 below. 


Table 1: Shows the parameters obtained from the dataset used in the COVID 19 prediction model 


Feature 

Data type 

Gender 

Categorical 

Age 

Numerical (Discrete) 

Leukocytes(WBC) 

Numerical (continuous) 

Platelets 

Numerical (continuous) 

C-reactive Protein(CRP) 

Numerical (continuous) 

AST 

Numerical (continuous) 

ALT 

Numerical (continuous) 

GGT 

Numerical (continuous) 

LDH 

Numerical (continuous) 

Neutrophils 

Numerical (continuous) 

Lymphocytes 

Numerical (continuous) 

Monocytes 

Numerical (continuous) 

Eosinophils 

Numerical (continuous) 

Basophils 

Numerical (continuous) 

Swab 

Categorical 


The platelet count in COVID patients was severely low, i.e., thrombocytopenia was associated 
with COVID 19. Researches have also shown how an elevated level of C Reactive Protein (CRP) might 
be associated with COVID. Elevated levels of alanine aminotransferase(ALT) and aspartate 
aminotransferase (AST) were reported in 16 - 53% of COVID 19 patients. 72% of COVID patients also 
showed elevated GGT levels[8], LDH increased to nearly 89% of patients. Only 21% of patients 
presented pathological values of white blood cells (WBCs), 18% had neutrophils count above upper 
normal range value, while 89% of patients had lymphocyte count below the lower normal range value. 
These visible changes made us establish them as parameters for our COVID-19 model. Typically, a 
large dataset is required to train an Al model. However, in our case, we have used a limited dataset 
but have still received 97% accuracy. 
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Imputation method: 

The dataset used for training of the prediction model was found to be missing values for some 
significant parameters. These missing data values were encoded as NAN or blank spaces if unknown. 
This flaw in the dataset was a huge drawback, and it was not accepted by the classifier, especially in 
the scikit-learn estimators' library. 

This setback was solved by simply removing the row or column containing the missing value, 
but it creates a huge downgrade in the performance of the classifier. As the dataset used as input was 
also limited, it was not appreciable to cut off the rows and columns in the dataset. Therefore, using 
imputation was the only best solution. Although other imputation methods were used on the dataset, 
we had identified that MICE (Multivariate Imputation by Chained Equations) imputation method 
works best when compared to KNN imputer, multi-imputer, single imputer as MICE works as an 
iterative imputer on the dataset[9]. 


Fancy-imputation method using MICE: 

Missing data and features can be obtained with the help of the auto-imputation method. It 
handles categorical variables well and applies a method called MICE, where the algorithm passes 
through data multiple times and iteratively works on to optimize imputations in every column one by 
one. Hence, it is also known as iterative imputer. The disadvantage of the MICE imputation method is 
execution time. It takes a longer duration for the imputation process. 


incomplete data imputed data analysis results pooled results 



Figure 1: Show the imputation process [10]. 


MICE also had the advantage of accepting any inputs with different data types such as binary 
or continuous data. It was robust in nature, which filled missing data using iterations on the predictive 
models. Each variable was imputed using other variables in the dataset. Iterative imputer was similar 
to the MICE package and showed multiple imputations by repeatedly applying on the same dataset 
with constant seed[10]. 
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Train-Test split: 

The performance of a classification model was determined with the help of the train-test 
splitting procedure. It is necessary to estimate the performance of machine learning algorithms. 
Dataset was split into training and testing data where training data were used to train the classifier 
and testing data to check the performance of the model as a quick and easy way to determine the 
predictive modelling problem with a limited dataset[ll]. 

The dataset was divided into two subsets. The first subset was used to fit the model and called 
the Training dataset. The second subset was used to make predictions but not used to train the model. 
This was called the Test dataset[12]. 

Split Configuration: 

• Train and Test dataset size. 

• Split percentage varies depending upon the dataset (trial and error method) 

• Computational cost in training the model. 

• Computational cost in evaluating performance of the model. 

• Commonly the dataset is split as: 

o Train: 80%, Test: 20% 
o Train: 60%, Test: 40% 
o Train: 50%, Test: 50% 

The most suitable split for the dataset under consideration was 80-20 as there were limited 
datasets available, whereas to train an efficient model, a large number of datasets were necessary. 

Classifier selection for prediction model: 

Upon implementation of various classifiers, including KNN, Decision tree, Random Forest, 
SVM, Logistic Regression, Gaussian, it was observed that the Random classifier works best and 
provides an accuracy of 91.071%. The prime reason behind this accuracy was using ensemble method 
implementation in Random Forest, which combines predictions of several estimators and improves 
the robustness of Random Forest. 

Random Forest is a classifier found in the modules of the scikit-learn package. It fits the 
number of decision tree classifiers on samples taken from the input dataset. This was further analysed 
to enhance the accuracy of prediction and avoid over-fitting. The key component in the random forest 
was the low correlation between models where an amalgam of uncorrelated models gave a better 
predictive accuracy than individual model accuracy. A major advantage of the Random forest is the 
use of a tree model, which prevents the occurrence of an individual model error from affecting the 
overall accuracy [13]. 
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Table 2: Input Parameters of Random Forest classifier 


Parameters 

Input value 

random state 

0 

n estimators 

200 

warm start 

True 

max_depth 

None(explores the whole tree) 


Improvisation of Random Forest Model 

We also considered a modification of the Random Forest algorithm, called three-way Random 
Forest classifier (TWRF), which allows the model to abstain on instances for which it can express low 
confidence; in doing so, a TWFR achieves higher accuracy on the effectively classified instances at the 
expense of coverage (i.e., the number of instances on which it makes a prediction). We also decided 
to consider this class of models as they could provide more reliable predictions in large part of cases 
while exposing the uncertainty regarding other cases to suggest further (and more expensive) tests 
on them. From a technical point of view, since Random Forest is a class of probability scoring 
classifiers, for each instance, the model assigns a probability score for every possible class. The 
abstention is performed based on two thresholds a, (3 e [0, 1], If we denote 1 for the positive class 
and 0 for the negative class, then each instance is classified as positive if score(l) > a and score(l) > 
score(O), negative if score(O) > (3 and score(O) > score(l) and, otherwise, the model abstains[14]. 


Result: 

The figures, including below Figures 2 and 3, indicate the decision threshold selection for alpha 
and beta values in the 3-way random forest in the ROC curve. ROC indicated the location of the 
thresholding for classification. 'No skill' curve is the linear model used for comparison with the random 
forest classifier, which is denoted by 'Random Forest'. 


LO 

08 

<U 

i 

> 

t/t 

£ 04 

4J 

fi 

0 2 

00 


Figure 2: Shows the true positive rate using a 3-way random forest compared with a no linear skill 
curve. 
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The improvised random forest algorithm gave an accuracy of 97.619% when the right choices 
of alpha and beta were made. However, a lower value for alpha means the possibility of a higher 
false-positive rate in the model. Hence, a wise choice of alpha and beta values of 0.80 and above were 
considered. Therefore, care must be taken at the time of selection of alpha and beta as the false 
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positive rates are frequent to occur if the threshold decision is not made right. Thus, choosing a 
threshold based on the ROC curve gives better accuracy and minimizes faulty outcome prediction in 
the model. 



Figure 3: Shows the precision obtained using the Proposed Predictor method 
Table 3: Comparison of classifiers applied to dataset under consideration 


Classifiers 

Accuracy 

KNN(K-Nearest Neighbors] 

82.14% 

Decision Tree 

85.71% 

SVMfSupport Vector Machine] 

83.29% 

Logistic Regression 

53.57% 

Gaussian 

80.35% 

LGBM(Light Gradient Boosting Machine] 

75% 

Random Forest 

91% 

Proposed method 

97.619% 


Discussion: 

The pandemic has underplayed the premium for speed of success and acceptability by 
regulators of the use of innovative trial strategies with a quantitative decision-making framework to 
support regulatory approval. These mechanisms, together with the use of real-world data on external 
controls and adaptive clinical designs, are not new and are increasingly well understood and accepted. 
They should be leveraged, and both benefits and risks must be considered where appropriate. These 
predictive proposals were not developed without their respective challenges. These include 
technological solutions needed to account for issues like privacy, security, and platform stability; 
likewise, providing accurate details using the patient data is an absolute necessity [15]. Nevertheless, 
these challenges can be addressed and overcome. Considering the significant benefit of these 
predictive model and bringing to patients 

Our model may not have been trained with an extensive dataset, but it has a high accuracy 
percentage, 97.619%, making it a commendable COVID testing model. Comparing with similar models, 
with accuracies of 96.1% as in [7] and 93% in [4], it is quite evident that the model is highly accurate 
even though it has not been tested using a huge dataset. The use of machine learning in this model 
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has also eliminated any error that may happen when the test is performed manually. It also weeds 
out the risk of COVID infection by healthcare workers. Since it becomes automated, the time also gets 
reduced, hence making more test performance possible. The only limitation is the limited dataset that 
has been used, but the precision of the device rectifies it. 

Conclusion: 

The dynamics of the disease profile COVID 19 continues to evolve at a high rate rapidly. It is 
significant to understand the clinical impacts of screening for COVID 19, especially with asymptomatic 
patients. As more and more suspected COVID 19 infection cases arise, the crisis chance of RT-PCR kits 
that are primarily used to detect the virus will also be increased. The size of relevant patient data 
available is huge, and gathering information and cumulating the data with the predictive model can 
be challenging [16]. Using Predictive models could help the users and predict and forecast the 
epidemic among the population[17]. 

This predictive model could act as a potential tool that could enable the researchers to 
develop further other similar solutions combining other parameters and subjective health data to 
provide better patient outcomes. The predictive model is accurate and eliminates the time factor, risk 
in transmission, and any human error. 
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